Skip to content

meteharun: Implemented CUDA matrix multiplication#27

Open
meteharun wants to merge 1 commit into
parallelcomputingabo:mainfrom
meteharun:meteharun
Open

meteharun: Implemented CUDA matrix multiplication#27
meteharun wants to merge 1 commit into
parallelcomputingabo:mainfrom
meteharun:meteharun

Conversation

@meteharun
Copy link
Copy Markdown

CUDA Matrix Multiplication - Submission by meteharun

What I did

  • Implemented two CUDA kernels:
    • naive_cuda_matmul: a basic version that multiplies matrices without any optimization.
    • tiled_cuda_matmul: an improved version using shared memory and tiling to make it faster.

Optimizations

  • Used shared memory to store tiles of A and B, which reduces the number of slow global memory accesses.
  • Applied tiling (block-level matrix multiplication), so threads in the same block work together on small submatrices.
  • Added proper synchronization between threads to make sure shared memory is used correctly.

Challenges

  • CUDA compatibility issues in puhti
  • Avoiding reuse of variables like cudaEvent_t start in both kernels, which caused compilation errors.

Results

  • Added a table to README.md showing the timing results for both CUDA versions, and how much faster the tiled version is compared to the naive one and the CPU version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant